Goto

Collaborating Authors

 navigation plan


WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks

arXiv.org Artificial Intelligence

Large-language-model (LLM) agents are becoming competent at straightforward web tasks, such as opening an item page or submitting a form, but still struggle with objectives that require long-horizon navigation, large-scale information extraction, and reasoning under constraints. DART, a general framework that enables a single LLM to handle such complex chores. DART (i) dynamically decomposes each objective into three focused sub-tasks--navigation, information extraction, and execution--so the model concentrates on one skill at a time, and (ii) continuously re-plans the decomposition as new webpages are revealed, taking advantage of newly discovered filters or shortcuts and avoiding redundant exploration. LLM-powered web agents have recently shown promising abilities in web navigation tasks (Drouin et al., 2024; He et al., 2024; Wei et al., 2025; Y ang et al., 2024a; Pan et al., 2024; Song et al., 2024). Benchmarks such as WebArena (Zhou et al., 2023) demonstrate that these agents achieve reasonable accuracy on simple objectives, highlighting their potential as general-purpose automation tools. However, when the objectives require more complex reasoning and multi-step exploration, the performance of these agents often collapses. As shown in Figure 1, on WebChoreArena (Miyai et al., 2025), a benchmark designed to test higher-complexity web tasks, agents powered by GPT -4o achieve only 8.0% accuracy on tasks across different web domains, far below the 46.6% accuracy on WebArena. This gap highlights a critical weakness of current worflows: while sufficient for simple goals, they are not well equipped for tasks demand multi-step reasoning, long-horizon navigation, and structured information processing. A closer examination reveals that the difficulty arises from cognitive overload. Complex tasks require agents to simultaneously navigate across multiple web pages, extract and track large amounts of information, and reason under constraints. Consider the following task from WebChore-Arena (Miyai et al., 2025): "T ell me the top 3 products with the highest number of reviews in Home Audio of Electronics within the price range of $1,000 to $9,999". As illustrated in Figure 1, product information is distributed across multiple nested web pages. Each page may contain tens of products with attributes such as price and number of reviews.


Vision Language Models Can Parse Floor Plan Maps

arXiv.org Artificial Intelligence

Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM context and particularly useful to mobile robots. Map parsing requires understanding not only the labels but also the geometric configurations of a map, i.e., what areas are like and how they are connected. To evaluate the performance of VLMs on map parsing, we prompt VLMs with floorplan maps to generate task plans for complex indoor navigation. Our results demonstrate the remarkable capability of VLMs in map parsing, with a success rate of 0.96 in tasks requiring a sequence of nine navigation actions, e.g., approaching and going through doors. Other than intuitive observations, e.g., VLMs do better in smaller maps and simpler navigation tasks, there was a very interesting observation that its performance drops in large open areas. We provide practical suggestions to address such challenges as validated by our experimental results. Webpage: https://shorturl.at/OUkEY


RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation

arXiv.org Artificial Intelligence

Mapping is crucial for spatial reasoning, planning and robot navigation. Existing approaches range from metric, which require precise geometry-based optimization, to purely topological, where image-as-node based graphs lack explicit object-level reasoning and interconnectivity. In this paper, we propose a novel topological representation of an environment based on "image segments", which are semantically meaningful and open-vocabulary queryable, conferring several advantages over previous works based on pixel-level features. Unlike 3D scene graphs, we create a purely topological graph with segments as nodes, where edges are formed by a) associating segment-level descriptors between pairs of consecutive images and b) connecting neighboring segments within an image using their pixel centroids. This unveils a "continuous sense of a place", defined by inter-image persistence of segments along with their intra-image neighbours. It further enables us to represent and update segment-level descriptors through neighborhood aggregation using graph convolution layers, which improves robot localization based on segment-level retrieval. Using real-world data, we show how our proposed map representation can be used to i) generate navigation plans in the form of "hops over segments" and ii) search for target objects using natural language queries describing spatial relations of objects. Furthermore, we quantitatively analyze data association at the segment level, which underpins inter-image connectivity during mapping and segment-level localization when revisiting the same place. Finally, we show preliminary trials on segment-level `hopping' based zero-shot real-world navigation. Project page with supplementary details: oravus.github.io/RoboHop/


Learning-based Preference Prediction for Constrained Multi-Criteria Path-Planning

arXiv.org Artificial Intelligence

Learning-based methods are increasingly popular for search algorithms in single-criterion optimization problems. In contrast, for multiple-criteria optimization there are significantly fewer approaches despite the existence of numerous applications. Constrained path-planning for Autonomous Ground Vehicles (AGV) is one such application, where an AGV is typically deployed in disaster relief or search and rescue applications in off-road environments. The agent can be faced with the following dilemma : optimize a source-destination path according to a known criterion and an uncertain criterion under operational constraints. The known criterion is associated to the cost of the path, representing the distance. The uncertain criterion represents the feasibility of driving through the path without requiring human intervention. It depends on various external parameters such as the physics of the vehicle, the state of the explored terrains or weather conditions. In this work, we leverage knowledge acquired through offline simulations by training a neural network model to predict the uncertain criterion. We integrate this model inside a path-planner which can solve problems online. Finally, we conduct experiments on realistic AGV scenarios which illustrate that the proposed framework requires human intervention less frequently, trading for a limited increase in the path distance.


Topological Planning with Transformers for Vision-and-Language Navigation

arXiv.org Artificial Intelligence

Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.


Tradeoff-Focused Contrastive Explanation for MDP Planning

arXiv.org Artificial Intelligence

End-users' trust in automated agents is important as automated decision-making and planning is increasingly used in many aspects of people's lives. In real-world applications of planning, multiple optimization objectives are often involved. Thus, planning agents' decisions can involve complex tradeoffs among competing objectives. It can be difficult for the end-users to understand why an agent decides on a particular planning solution on the basis of its objective values. As a result, the users may not know whether the agent is making the right decisions, and may lack trust in it. In this work, we contribute an approach, based on contrastive explanation, that enables a multi-objective MDP planning agent to explain its decisions in a way that communicates its tradeoff rationale in terms of the domain-level concepts. We conduct a human subjects experiment to evaluate the effectiveness of our explanation approach in a mobile robot navigation domain. The results show that our approach significantly improves the users' understanding, and confidence in their understanding, of the tradeoff rationale of the planning agent.


Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation

arXiv.org Artificial Intelligence

We propose an end-to-end deep learning model for translating free-form natural language instructions to a high-level plan for behavioral robot navigation. The proposed model uses attention mechanisms to connect information from user instructions with a topological representation of the environment. To evaluate this model, we collected a new dataset for the translation problem containing 11,051 pairs of user instructions and navigation plans. Our results show that the proposed model outperforms baseline approaches on the new dataset. Overall, our work suggests that a topological map of the environment can serve as a relevant knowledge base for translating natural language instructions into a sequence of navigation behaviors.


High Level Path Planning with Uncertainty

arXiv.org Artificial Intelligence

For high level path planning, environments are usually modeled as distance graphs, and path planning problems are reduced to computing the shortest path in distance graphs. One major drawback of this modeling is the inability to model uncertainties, which are often encountered in practice. In this paper, a new tool, called U-yraph, is proposed for environment modeling. A U-graph is an extension of distance graphs with the ability to handle a kind of uncertainty. By modeling an uncertain environment as a U-graph, and a navigation problem as a Markovian decision process, we can precisely define a new optimality criterion for navigation plans, and more importantly, we can come up with a general algorithm for computing optimal plans for navigation tasks.


Learning to Interpret Natural Language Navigation Instructions from Observations

AAAI Conferences

The ability to understand natural-language instructions is critical to building intelligent agents that interact with humans. We present a system that learns to transform natural-language navigation instructions into executable formal plans. Given no prior linguistic knowledge, the system learns by simply observing how humans follow navigation instructions. The system is evaluated in three complex virtual indoor environments with numerous objects and landmarks. A previously collected realistic corpus of complex English navigation instructions for these environments is used for training and testing data. By using a learned lexicon to refine inferred plans and a supervised learner to induce a semantic parser, the system is able to automatically learn to correctly interpret a reasonable fraction of the complex instructions in this corpus.


Probabilistic Hybrid Action Models for Predicting Concurrent Percept-driven Robot Behavior

Journal of Artificial Intelligence Research

Most autonomous robots are equipped with restricted, unreliable, and inaccurate sensors and effectors and operate in complex and dynamic environments. A successful approach to deal with the resulting uncertainty is the use of controllers that prescribe the robots' behavior in terms of concurrent reactive plans (CRPs) -- plans that specify how the robots are to react to sensory input in order to accomplish their jobs reliably (e.g., McDermott, 1992a; Beetz, 1999). Reactive plans are successfully used to produce situation specific behavior, to detect problems and recover from them automatically, and to recognize and exploit opportunities (Beetz et al., 2001). These kinds of behaviors are particularly important for autonomous robots that have only uncertain information about the world, act in dynamically changing environments, and are to accomplish complex tasks efficiently. Besides reliability and flexibility, foresight is another important capability of competent autonomous robots (McDermott, 1992a).